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ABSTRACT 

In many emerging applications, data streams are monitored in a 
network environment. Due to limited communication bandwidth 
and other resource constraints, a critical and practical demand is 
to online compress data streams continuously with quality guar- 
antee. Although many data compression and digital signal pro- 
cessing methods have been developed to reduce data volume, their 
super-linear time and more-than-constant space complexity pre- 
vents them from being applied directly on data streams, partic- 
ularly over resource-constrained sensor networks. In this paper, 
we tackle the problem of online quality guaranteed compression of 
data streams using fast linear approximation (i.e., using line seg- 
ments to approximate a time series). Technically, we address two 
versions of the problem which explore quality guarantees in dif- 
ferent forms. We develop online algorithms with linear time com- 
plexity and constant cost in space. Our algorithms are optimal in 
the sense they generate the minimum number of segments that ap- 
proximate a time series with the required quality guarantee. To 
meet the resource constraints in sensor networks, we also develop 
a fast algorithm which creates connecting segments with very sim- 
ple computation. The low cost nature of our methods leads to a 
unique edge on the applications of massive and fast streaming en- 
vironment, low bandwidth networks, and heavily constrained nodes 
in computational power. We implement and evaluate our methods 
in the application of an acoustic wireless sensor network. 

Categories and Subject Descriptors 

C.3 [Computer Systems Organization]: Special-Purpose and Ap 
plication-Based Systems; G.1.2 [Mathematics of Computing- 
Numerical Analysis]: Approximation-Lz'near approximation 

General Terms 

Algorithms, Design, Performance 

Keywords 

Wireless Sensor Networks, Data Streaming, Linear Approximation 

1. INTRODUCTION 

In many emerging applications, massive data streams are mon- 
itored in a network environment. For example, large sensor net- 
works are extensively used in wildlife monitoring, road traffic mon- 
itoring, and environment surveillance. Each sensor generates a data 
stream where new data entries (i.e., new readings) keep arriving in 
a continuous manner. In order to aggregate and analyze the massive 
streaming data under monitoring, it is often required to transmit the 
data streams in the network. Due to often limited communication 



bandwidth and other resource constraints, online compressing data 
streams continuously with quality guarantee rises as a natural, crit- 
ical and practical demand in those applications. 

Example 1 (Motivation). We, the authors of this paper, 
are building an acoustic monitoring system using wireless sensor 
networks. Sensor nodes are deployed in a target area, while each 
node contains an acoustic sensor which samples sound signals con- 
tinuously. The sensor nodes are connected by a wireless network. 

The acoustic monitoring system has many applications. An ap- 
pealing scenario is towards "smart conference hall." By analyzing 
the data collected from an acoustic monitoring system deployed in 
a large conference place, we can identify and locate speakers as 
well as some of their activities. The information can be used to 
adjust the equipment such as the light system, the microphone sys- 
tem, the video monitoring system, and the air conditioning system. 
Another potential application is bird surveillance in wildness. By 
analyzing the bird sound collected using such a sensor network, or- 
nithologists can study the distribution of birds and their behavior 
patterns. 

Wireless sensor nodes which integrate sensors, processors, mem- 
ory and wireless transceivers often are small and have only very 
limited computational power and communication bandwidth. For 
instance, the Chipcon radio chip in the broadly-used MICA2 motes 
1 15 1 has the maximum transmission power of 27 mA and the max- 
imum bandwidth of 38 kbps. 

In our acoustic monitoring system, we use MICA2 motes. One 
technical challenge is that, although a sensor can sample the acous- 
tic signals frequently, the acoustic data stream cannot be sent out in 
time due to the low bandwidth radio channel. Specifically, in order 
to make the data analysis useful, we need to sample human voice 
with the normal sampling rate of 8 kHZ and 16 bits per sample. 
This sampling mode requires the bandwidth of 128 kbps for 1 chan- 
nel (mono) voice, which greatly exceeds the maximum bandwidth 
of 38 kbps that an MICA2 mote can support.In addition, we can- 
not temporarily store a large number of samples since the memory 
size of MICA2 motes is only 512 kb. The only technical solution 
to the bottleneck is to online compress data streams continuously 
and send out the compressed streams instead of the original streams 
through the network. Sending compressed streams can also reduce 
the power consumption of sensors on communication, and thus ex- 
tend lifetime of sensors. In large environmental surveillance sensor 
networks, recharging or replacing batteries of sensor nodes is often 
very difficult or even impossible after the sensors are deployed. ■ 

Many data compression and digital signal processing methods 
have been developed to reduce data volume, such as Fourier trans- 
form 1171 . discrete cosine transform [ 14 1, Wavelets [2], linear pre- 
dictive coding (LPC) [If, etc. However, those methods cannot be 



applied to data stream compression in sensor networks due to the 
high cost of those methods in time and space. Moreover, sensor 
nodes like MICA2 motes only have very limited computational 
power. For example, only simple arithmetic operations are sup- 
ported by TinyOS |3|, the operating system for MICA2 motes. Al- 
though it is possible to implement a mathematical module to cal- 
culate essential functions like sinusoid and exponential functions 
or use dedicated DSP chips for audio processing and compression, 
such complex modules are highly undesirable due to the limited 
memory size and computational capacity of MICA2 motes as well 
as the extra energy cost of dedicated DSP chips. 

In this paper, we tackle the problem of online compression of 
data streams in the application context of sensor networks. Partic- 
ularly, we aim at the fast linear approximation methods (i.e., using 
line segments to approximate a time series) with quality guarantee. 
We make the following contributions. 

First, we model the piecewise linear approximation problem prop- 
erly for data streams. Different from the conventional situations 
where the whole time series to be compressed and the required 
compression rate can be specified, a data stream is potentially un- 
limited, and the distribution is often unpredictable. We propose the 
error-bounded piecewise linear approximation problem to tackle 
those challenges. Second, we present fast online solutions with lin- 
ear time complexity and constant cost in space. Our algorithms are 
optimal in the number of segments used to approximate a (poten- 
tially unlimited) time series. In other words, our algorithms create 
the minimum number of line segments even without knowing the 
future incoming data. To the best of our knowledge, we are the 
first to successfully devise algorithms with such strong guarantees. 
Third, to address the computational challenges in sensor nodes, we 
develop another online approximation algorithm that is particularly 
tailored for tiny sensor devices by requiring only very simple com- 
putation. The low cost nature of our methods leads to a unique 
edge on the applications of massive and fast streaming environ- 
ment, low bandwidth networks, and heavily constrained nodes in 
computational power (e.g., tiny sensor nodes). Last, we implement 
and evaluate our methods in the application of an acoustic wire- 
less sensor network. Our empirical evaluation clearly shows that 
our methods are highly feasible for resource-constrained wireless 
sensor networks. 

The rest of the paper is organized as follows. In Section [2] we 
formulate and analyze the problem, and review the related work. 
Two online algorithms are developed in Section [3] and their opti- 
mality is studied in Section [4] In Section [5] we design an online 
approximation algorithm which is more economic in computation 
for tiny sensors. We report our implementation and evaluation of 
the proposed methods in an acoustic wireless sensor network in 
Section[6] The paper is concluded in Section|7] 

2. PROBLEM DEFINITION AND RELATED 
WORK 

In this section, we propose the error-bounded piecewise linear 
approximation problem for data streams. We also review the related 
work. 

2.1 Problem Formulation 

Piecewise linear approximation (PLA) is an effective method to 
compress a time series. A numeric data stream can be treated as 
a potentially unlimited time series. Thus, it is natural to explore 
whether we can compress a numeric data stream using the piece- 
wise linear approximation method. 

Let X = xi ■ ■ ■ x n be a time series of n points, and Xi (1 < i < 




Figure 1: Piecewise linear approximation. 

n) be the value of the i-th point of X. A (line) segment is a tuple 
s = ((i,Vi), (j,Vj)) where i < j and (i, yi ) and (j,yj) are two 
endpoints. [i,j] is called the range of s. 

Given a time series X, PLA uses a set of line segments as the 
approximation of the time series. Figure Q] elaborates the general 
idea, where three line segments, AA', BB', and CC' , are used 
to approximate a time series. A line segment s = ((i, t/i), (j, yj)) 
approximates the fc-th point (i < k < j) of the time series by value 

2fc = Vi + - — -(jjj - Vi). 
J -i 

The compression comes from that the number of line segments used 
for approximation can be much smaller than the number of points 
in the time series. In the figure, the time series has 18 points, three 
segments are used to approximate the time series, and each seg- 
ments has 2 endpoints. Thus, the 3 line segments only need 6 points 
to represent. A compression ratio of 3 is achieved. Generally, the 
endpoints in the segments are not necessarily positioned at some 
points in the time series (e.g., B, B', and C' in the figure). 

Formally, a set of segments X = {si, . . . , s m } is a piecewise 
linear approximation of X if (1) si, . . . , s m are segments; and (2) 
for each index i (1 < i < n), i is either in the range of exactly 
one segment in X, or there exist two segments s,s' £ X such 
that s and s' share the same endpoint at index i. Clearly, using the 
segments, for every index i, X can give a value Xi to approximate 

Xi. 

PLA for static time series has been well studied (e.g., (5][6][U 
[T2l ). Most of the previous studies address an optimization problem 
as follows. 

Problem 1 (Conventional PLA problem). Given a time 
series Xofn points and a number m < n, find a set ofm segments 
as a piecewise linear approximation of X such that the approxima- 
tion error is minimized. m 

Unfortunately, solutions to the conventional PLA problem are 
not applicable to data streams. A data stream is potentially unlim- 
ited. It is impossible to know in advance the number of points in the 
stream or to specify the number of segments to be used for approx- 
imation. To tackle the stream compression problem, in this paper, 
we turn to the error-bounded PLA problem. 

Problem 2 (Error-bounded PLA problem). Given an 
error measurement function errQ such that err(X,X) gives the 
error that a PLA X approximates X. Let ebe a user-specified error 
bound. X is called an e-PLA of X iferr(X,X) < e. An e-PLA 
X of X is optimal (/ \X\ (i.e., the number of segments in X) is 
minimized. m 

We propose two error measurement functions meaningful for 
data streams. 



First, the max-err function captures the maximal error between 
X and X at any index. That is, 

maxerr{X,X) = max{|a;i — Xi\} 

i—l 

With potentially unlimited streams, using the max-err function, we 
can make sure the approximation quality is consistently bounded at 
every point. 

Second, the seg-err function checks the error introduced by each 
segment, and captures the maximal error. That is, 

segerr(X, X) — max{ (xi — Xi) 2 } 

Using the seg-err function, we can make sure that the error intro- 
duced by every segment is bounded. 

Using the two error measurement functions, we have two ver- 
sions of the error-bounded PLA problem. 

PROBLEM 3 (PLA-PointBound PROBLEM). Given an error- 
bound e, the PLA-PointBound problem is to find an e-PLA X such 
that maxerr(X, X) < e and \X\ is minimized. m 

Problem 4 (PLA-SegmentBound problem). Given an 
error-bound e, the PLA-SegmentBound problem is to find an e- 
PLA X such that segerr(X, X) < e and \X\ is minimized. m 

2.2 Related Work 

Piecewise linear approximation (PLA) has been well investigated 
in O |7l [U El [TU Qll. The idea behind PLA comes from the fact 
that a sequence of line segments can be used to represent the time 
series while preserving a low approximation error. Standard linear 
regression technique is widely used in most existing piecewise lin- 
ear approximation algorithms to calculate a line segment approx- 
imating the original data with the minimum mean squared error. 
Many of them |5 6 8 12] target at solving the conventional PLA 
problem and may not be applicable to streaming data. 

Despite the substantial research efforts in PLA techniques (5][U 
IZHUM 12], existing solutions are not tailored for data streams over 
resource-constrained sensor networks. They either require complex 
computation or have high cost in space. To the best of our knowl- 
edge, there has no implementation of these algorithms in realistic 
sensor device. 

In 1 9 1, the authors use PLA to estimate a time series. But the au- 
thors put unnecessary constraints on the algorithm, which requires 
the endpoints come from the original dataset. On the whole, their 
algorithm can run in 0(n logn) time complexity and takes 0(n) 
space complexity. 

In (7|, Keogh et al. give a comprehensive review on the existing 
techniques for segmenting time series. They categorize the solu- 
tions into three different groups, namely sliding window methods, 
top-down methods, and bottom-up methods. They then take advan- 
tage of both sliding window and bottom-up methods and design a 
Sliding-Window-And-Bottom-up (SWAB) algorithm. The SWAB 
algorithm uses a moving window to constrain a time period in con- 
sideration. 

In fH \, an amnesic function is introduced to give weights to dif- 
ferent points in the time series. The PLA-SegmentBound problem 
is discussed in the context of Unrestricted Window with Absolute 
Amnesic (UAA) problem, but complete solutions to this problem 
are not provided in 1111 . 

A solution to the PLA-PointBound problem is addressed in 1 10 1 
with a different definition of point error bound. The algorithm is 
claimed to be optimal, but the time complexity is 0(n 3 ) where n is 



the number of points in the time series. Moreover, no performance 
evaluation of the solution is presented in the paper. 

In summary, although the error-bounded PLA problem has been 
investigated before, the problem has not been studied systemati- 
cally. No solutions applicable to data streams have been developed, 
let alone solutions for resource-constrained sensor networks. 

3. ONLINE ALGORITHMS 

In this section, we develop two online algorithms for the PLA- 
PointBound and the PLA-SegmentBound problems, respectively. 
The two algorithms share the same framework. 

3.1 The Framework 

The framework of our algorithms works in a greedy manner. 
When xi, the first point in the stream, arrives, we store xi. When 
X2 arrives, we also store X2 since x\ and X2 can be compressed by 
a segment exactly. When X3 arrives, we check whether x$ can be 
compressed together with x\ and X2 by a line segment satisfying 
the error-bound requirement. If so, we store X3. Otherwise, we 
output a line segment compressing x± and X2, remove xi and X2 
from the main memory, and store x$. 

Generally, imagine we have a buffer in main memory storing 
points Xi,Xi+i, . . . ,Xj such that the points in the buffer can be 
compressed by a line segment satisfying the error-bound require- 
ment. When a new point Xj+i arrives, we check whether Xj+i 
can be compressed together with Xi, . . . , Xj by a line segment sat- 
isfying the error-bound requirement. If so, we add Xj+i to the 
buffer and move on to the next point. Otherwise, we output a seg- 
ment compressing x\, . . . , Xj satisfying the error-bound require- 
ment, and remove them from the buffer. Xj+i is then stored in the 
buffer. 

Although the framework is simple, there are two critical issues 
that need to be solved carefully in order to make sure that the 
runtime of the algorithms is linear with respect to the number of 
points in the streams, and the space size needed by the algorithms 
is bounded by a constant. 

First, how can we store the information about the points we have 
seen but have not compressed? In the worst case, there can be 
an unlimited number of such points (e.g., a times series where all 
points take the same value). How can we summarize them using 
only constant size memory? 

Second, how can we determine whether a newly arrived point 
can be compressed together with the points already in the buffer 
that have been seen but have not been compressed? Revisiting those 
points one by one leads to the runtime quadratic with respect to 
the number of such points. As explained before, there can be an 
unlimited number of such points. The overall time complexity is 
quadratic if those points are revisited one by one. 

Our central idea to tackle the above two challenges is the follow- 
ing. Instead of storing the points explicitly, we monitor the range of 
all possible line segments that can be used to compress the points 
that have been seen but have not been compressed in a concise way. 
When a new point arrives, we can check whether the point can be 
compressed using some line segment in the range. If so, it means 
that the new point can be compressed together with the points ac- 
cumulated. We only need to adjust the range of the possible line 
segments to make sure the new point is also compressed. If not, it 
means that the new point cannot be compressed together with the 
points accumulated. A segment should be output. 

3.2 Solving the PLA-PointBound Problem 

A segment s = ((i, yi), (j, yj)) can also be represented by the 
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Figure 2: Ranges of possible line segments. 
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Figure 3: Polygon poly(i,j). 
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left endpoint (i,yi), 
right endpoint j. 

For two points Xi and Xj in a data stream, if a line segment s = 

{{hVi), (J,Vi)) with slope m = 



-£Ej| < e and \x ,• 



3 _ ' can approximate a;* and 
< e where e is the error-bound, 



s must satisfy the following four conditions. 

(xi - e) <yi< (xi + e) 
(xj + e) - 



mi = 



m-2 



7712 < Wl < Kll 



(1) 

(2) 

(3) 
(4) 

Figure[2]illustrates the conditions and their relations. Particularly, 
mi and m,2 are the slopes of the two lines shown in the figure. 

Since the line segments are determined by the value of the left 
endpoint j/j and slope m, we examine the distribution of points 
(yi,m) that satisfy Equations Q] to [4] As illustrated in Figure [3] 
the possible line segments form a polygon poly(i,j). We have the 
following important result. 

LEMMA 1 (PLA-POINTBOUND). A line segment of left end- 
point yi and slope m can approximate points Xi, , Xj with max- 

err at most e if and only if (yi, m) is in polygon poly(i, i + 1) PI 
poly(i, i + 2) l~l • • • n poly(i, j). 

Proof. The necessity follows with the definition of poly(i, j). For 
any line segment s poly(i,i+l)r\poly(i,i+2)f]- ■ -C\poly(i, j), 
there exists an index k (i < k < j) such that s £ poly(i, k), i.e., s 
cannot approximate either Xi or Xk- 

We prove the sufficiency by contradiction. Suppose a segment 
s 6 poly(i, i + 1) PI poly(i, i + 2) fl • • • fl poly(i,j) but s cannot 
approximate Xk (i < k < j). Two situations may arise. First, 
k — i. Then, s ^ poly(i,i + 1) since |a;i — > e where i/i is 



Input: a data stream X = xi, X2, ■ ■ ■ and error-bound e; 
Output: a list of line segments X approximating X 

such that maxerr(X, X)) < e; 
Method: 

1: P =poly(l,2); i = 1; j = 3; 

WHILE (1) DO { 

P' = Pnpoly(i,j); 
IF P' / THEN P = P', j=j+ 1; 
ELSE { 

randomly choose a point (y, m) in P; 
/*any point in P meets the point error bound'*/ 
output a line segment 

((i,y), (j - 1,2/ + (i - * m)); 

P = polylj, j + 1); i = j; j = j + 2; 



} 



} 



Figure 4: PointBound, an online algorithm for the PLA- 
PointBound problem. 



the value of s on index i. Second, k i. Then, s poly(i, k). In 
both cases, we have contradictions. ■ 

Using Lemma[T] we have algorithm PointBound, an online algo- 
rithm as shown in Figure [4] We maintain the intersection of poly- 
gons poly(i, i + 1), . . . , poly(i,j), where Xi is the first point that 
has not been compressed yet in the data stream, and Xj is the last 
point arrived such that polyii, i + 1) fl . . . fl poly(i, j) ^ 0. 

When a new point Xj+j arrives, we compute poly(i,j + 1) and 
poly(i, i + 1) n . . . D poly(i,j) fl polyii, j + 1). If it is 0, then a 
line segment s is randomly chosen to approximate Xi, . . . , Xj such 
that (yi, m) is in poly(i, i + 1) fl . . . fl poly(i, j), where yi is the 
value of s on index i, and m is the slope of s. s is output, and 
the intersection of polygon is removed. Xj+i and Xj+2 are used to 
generate a new polygon poly(j + 1, j + 2). 

If poly(i, i + 1) (~l . . . Hpoly(i,j) Dpoly(i,j + 1) 7^ 0, then the 
intersection is kept, and the algorithm moves on to the next point in 
the stream. 

For any i and j, poly(i, j) is a parallelogram where there are two 
edges parallel to the slope axis. It is easy to show that for any i and 
j, n J k=i poly(i, k) is a convex polygon. In the worst case, the edges 
of the intersection of parallelograms could be up to 2(j — i + 1), 
i.e., twice the number of parallelograms intersected. A straightfor- 
ward method keeping all edges of the intersection area still has the 
quadratic time complexity and linear space complexity, which are 
not applicable to data streams. 

Fortunately, we do not need to record all edges of the intersection 
polygon. Instead, we need to record only up to 4 edges to determine 
whether a new point can be compressed together with the points 
seen but not compressed. 

Using Equations Q] to [4] it is easy to see that each parallelogram 
has two properties: (1) Each parallelogram has two vertical edges 
and two sloping edges with a negative slope value, as shown in 
Figure [3] The range of is the same for all parallelograms (i.e., 
Xi — e < yi < Xi + e). (2) For j% > j\ > i, the absolute slope 
value of the two sloping edges in poly(i,j2) is strictly smaller than 
the absolute slope value of the two sloping edges in poly(i, 

Let us focus on the intersection points of the upper sloping edge 
of parallelograms. The case for the lower sloping edges can be 
analyzed similarly. 

The situations are illustrated in Figure [5] Suppose that the first 
parallelogram gives the upper sloping edge AB with slope value 





Figure 5: Using up to 4 edges to represent the intersection poly- 
gon. 

rriAB as in Figure [Ha). When a new data point arrives, a new 
parallelogram is formed. In the worst case, the upper sloping edge 
of the parallelogram CD cuts AB into two parts. Let E be the 
intersection point between AB and CD, as shown in Figure[5lb). 

By the second property, we have |mcr>] < \ttiab\- Moreover, 
the upper sloping edge FG of any future parallelogram cannot cut 
both CE and EB due to the smaller absolute slope value of FG 
than mcD- In other words, if a future parallelogram intersects with 
the current intersection polygon, the upper sloping edge of the par- 
allelogram can only cut either CE, EB or the right vertical edge. 
Instead of keeping CE and EB, we can keep line segment CB. 
Then, a future parallelogram intersects with the current intersec- 
tion polygon if and only if it cuts CB. 

Generally, we only need to keep the line segment connecting the 
left-most upper corner and the right-most upper corner for the up- 
per sloping edges. Similarly, we only need to keep the line segment 
connecting the left-most lower corner and the right-most lower cor- 
ner for the lower sloping edges. 

In addition to this two line segments, we need to keep the two 
vertical edges in the intersection polygon. The reason is that the in- 
tersection of two parallelograms may shrink the range of the inter- 
section, as illustrated in Figure[5jc), where parallelogram ABCD 
intersects with parallelogram EFGH . The left vertical edge is 
shrunk into a point I right to the original edge. 

In summary, we need to record only up to 4 edges to determine 
whether a new point can be compressed together with the points 
seen but not compressed. This immediately leads to the following 
result. 

Theorem 1 (Complexity- PointBound). 
The algorithm PointBound for the PLA-PointBound problem has 
the time complexity O(n) and the space complexity O(l), where n 
is the number of points in a time series to be compressed. m 

Since algorithm PointBound only looks ahead for one point in 
the data stream to output a line segment whenever necessary in the 
piecewise linear approximation, it is an online algorithm and can 
be applied on data streams. 

3.3 Solving the PLA-SegmentBound Problem 

We first present the following useful observation, to which a sim- 
ilar result has been reported in 1121 without proof. 

LEMMA 2. Suppose that a line segment s approximates a frag- 
ment X ofn points xi, ,x n in a time series. Then, s minimizes 

segerr(s,X) if the slope ofs is 



(£7=i 



IV" r 



(£r =1 * 2 )-^(£r=i*) 2 

and the left endpoint of s has value 



(5) 



Input: a data stream X = xi, X2, ■ ■ ■ and error-bound e; 
Output: a list of line segments X approximating X 

such that maxerr(X, X)) < e; 
Method: 

1: i = l;i = 3 

2: s = the line segment ((1, x\), (2, £2)); 
3: WHILE (1) DO { 

4: s' — the line segment identified in Lemma[2]to 

compress Xi, . . . ,xf, 
5: IF segerr(s', Xi ■ ■ ■ Xj) < e THEN 
6: s = s'; j = j + 1; 

7: ELSE { 
8: output s; 

9: i = j; + 

10: s — the line segment ((z, Xi), (i + 1, Xi+i)); 
} 

} 



Figure 6: SegmentBound, an online algorithm for the PLA- 
SegmentBound problem. 



Proof. Consider a line segment s approximating fragment X. Let 
the left endpoint of s be (1, yi) and the slope be m. For each point 

Xi (1 < i < n), the error is \xi — Xi\ = \xi — yi — m(i — 1)|. 
Thus, 



segerr = ^^(xi — yi — m(i — l)) 2 



(6) 



Clearly, when yi = m + — LH0l t segerr reaches the mini- 
mum value 

n n n 

segerr — x\ + m 2 i 2 — 2m Xii— 

i=l i=l 

(T,7=i( x i- i * m )) 



i=l i=l i=l (7) 

2 



From Equation 0, when 

fV™ ; T .\_IV" 7 'V n r 
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segerr is minimized. 



m + 



Lemma[2]leads to algorithm SegmentBound, an online algorithm 
for the PLA-SegmentBound problem as shown in Figure [6] Sup- 
pose xi, . . . , x n are the points that have not been compressed yet. 
When a new point x n +i arrives, we check whether the line segment 
identified by Lemma[2]can achieve the segment error bound. If so, 
then x n +i is added into the buffer, and the algorithm moves on to 
the next point in the stream. Otherwise, the line segment suggested 
by Lemma [2] for points Xi, . . . ,x n is output, and Xi,...,x n are 
considered compressed. Xi+ n is added into the buffer. 

When a new data point x n +i arrives, the left endpoint and the 
slope of the line segment suggested by Lemma|2]can be calculated 
quickly. Technically, Equations l[5} and Q indicate that we need to 
calculate h J27=i x >> Z)"=i x ^ T,7=i x i> and ? ■ 

Since we already have £)" =1 *,XT=i x *> Hi=i ^'£"=1 x ?' 311(1 
y^? —1 i 2 , the addition of the new point only incurs a constant cost 
to update the values of m and the left endpoint. This leads to the 
following result. 

Theorem 2 (Complexity - SegmentBound). 
The algorithm SegmentBound for the PLA-SegmentBound problem 



has the time complexity 0(n) and space complexity 0(1), where n 
is the number of points in a time series to be compressed. m 

4. OPTIMALITY 

Theorem 3 (PLA-PointBound quality). 
The PointBound algorithm in Section U/2\p roduces a minimum num- 
ber of segments to compress a time series. 

Proof. For a time series X = x\, . . . ,x n , let I = min{|X|}, 
where X is an e-PLA approximating X (i.e., maxerr(X , X) < e). 
We conduct an induction on I to show that algorithm PointBound 
outputs an e-PLA of I line segments. 

(Base case) Consider I = 1, i.e., there exists a line segment 
that approximates the whole time series. According to Lemma [TJ 
poly(l, 2) n ■ ■ ■ n poly(l, n) ^ 0. Thus, algorithm PointBound 
finds a line segment s approximating x\, . . . ,x„ and maxerr{s, X) < 
e. 

(Induction) Assume that, when I < k, algorithm PointBound 
finds an e-PLA X of I line segments to approximate X. Now, let 
us consider the case of I = (k + 1), i.e., there exists an optimal 
e-PLA Y = {si , . . . , Sfe+i} that approximates X. 

Suppose that si approximates xi, . . ■ , x m . Let us assume that s'i 
output by algorithm PointBound approximates points xi, . . . , x m i . 
Due to Lemma Q] poly(l,2) n ••• n poly(l,m) / 0. Thus, 
s[ must approximate Xi, . . . ,x m with the quality guarantee, i.e., 
maxerr(s'i,xi ■ ■ ■ x m ) < e. In other words, m' > m. 

If m = to', then points x m +\, . . . , x n 'm X can be approximated 
by an e-PLA of (I — 1) = k line segments. According to the 
assumption, algorithm PointBound finds an e-PLA of (I — 1) line 
segments approximating Xm+l, ■ ■ ■ , x n . 

Suppose that m' > m. Since a; m +i, ... ,x n can be approx- 
imated by an e-PLA of (I — 1) line segments, a proper subset 
%m'+l> ■ ■ ■ > x n must also be approximated by an e-PLA of at most 
(I — 1) = k line segments. We only need to drop the segments 
approximating x m +i, ■ ■ ■ ,x m '. According to the assumption, al- 
gorithm PointBound finds an e-PLA of the minimum number of 
line segments to approximate points x m / +1 , . . . , x n . 

In summary, algorithm PointBound finds an e-PLA of I = (k + 
1) line segments approximating X. m 

Similarly, we can also show the optimality of the SegmentBound 
algorithm. 

Theorem 4 (PLA-SegmentBound quality). 
The SegmentBound algorithm in Section \3.3\ produces a minimum 
number of segments to compress a time series. 

Although the number of line segments used to approximate a 
time series is a good measure on the compression quality, it is 
not directly translated to compression ratio. For example, in our 
methods, the endpoints of segments are not constrained. Thus, two 
points are needed to represent a segment. On the other hand, a 
PLA using connecting segments (i.e., two consecutive segments 
share the same endpoint) may use more segments but achieve a 
better compression ratio since only one point is needed to represent 
a segment except for the first segment. 

Theorem 5 (Compression factor). 
Algorithms PointBound and SegmentBound have an approximation 
factor of 2 to the optimum compression factor that an e-PLA can 
achieve. 

Proof. We only show the case for the PointBound algorithm. The 
same argument applies to the SegmentBound algorithm. 
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Figure 7: An example of zoning angle 

For any time series X of m points, suppose that the PointBound 
algorithm approximates X using n line segments. Then, according 
to Theorem[3] any PLA cannot have less than n line segments. To 
represent n line segments, at least (n + 1) points are needed. Thus, 
the optimum compression ratio using PLA is at most a opt = 

The line segments generated by the PointBound algorithm may 
not be connecting. Thus, at most 2n points are needed to represent 
the n line segments. The worst case compression ratio of the Point- 
Bound algorithm is OL PointBomd = ^ . Clearly, ap = 

^_ <% " - u 

5. PLAZA FOR TINY SENSORS 

Although algorithm PointBound is optimal for the PLA-PointBound 
problem, it still may be too computation intensive for tiny, resource- 
constrained sensors due to two reasons. 

First, algorithm PointBound may generate non-connecting seg- 
ments such that each segment requires the transmission of two end- 
points. As analyzed before, connecting line segments reduce the 
data transmission volume since each segment (except the first one) 
requires the transmission of only one endpoint. Second, algorithm 
PointBound has to calculate intersection of parallelograms. The 
computation may be too heavy for tiny, resource-constrained sen- 
sor nodes. 

In this section, we design a simple, fast online algorithm PLAZA 
(Piecewise Linear Approximation with Zoning Angle) for the PLA- 
PointBound problem. PLAZA generates connecting line segments. 
Although PLAZA is not optimal in the number of line segments 
used for approximation, it is light in computation and very effective 
in compression ratio, as will be verified by our experiments. 

5.1 PLAZA 

PLAZA builds on the concept of zoning angle. Given an er- 
ror bound e and two points (i, Xi) and (k,Xk) {i < k), the zon- 
ing angle from (i, xi) to (k, Xk), denoted by Oh m, is defined as 
the angle that has (i, Xi) as the endpoint, ((i,Xi), (k,Xk)) as the 
bisector, and has a degree of 2arctan n^r-r , where \xiXk\ ~ 

y/(k-i)* + (x k -Xi)*. 

Figure|7Ia) shows an example of zoning angle 9% k y The zoning 
angle defines a zone to include any potential line segments that can 
be used to compress Xi and Xk- 

We observe the following important results. Their proof is trival 
and is omitted due to space limit. 

Lemma 3. For three points Xi, Xh, Xj(i < k < j) in a time 
series, the line segment ((i, Xi), (j, Xj)) approximates x^ with er- 
ror up to e if and only if the line segment ((i, Xi), (j, Xj)) falls in 
the zoning angle 0^ k y 

LEMMA 4. For three points Xi, Xk, Xj(i < k < j) in a time se- 
ries, if zoning angle Q\i^) has no overlap with zoning angle dy k y 



Input: a data stream X = xi, X2, ■ ■ ■ and error-bound e; 
Output: an e-PLA X of a list of connecting line 

segments, i.e., maxerr(X,X)) < e; 
Method: 

1: i = l; angle — 9% 2 y, 

2: s = line segment ((1, xi), (2, X2)); j = 3; 

3: WHILE (1) DO { 

4: angle = angle l~l 0^ 

5: IF anpZe / THEN { 

6: IF segment ((i, x;), (j, Xj)) falls in angle 

7: THEN s = line segment ((i, X;), (j, Xj)); 

8: ELSE { 

9: Xj = the value of the bisector line of 

angle at index j as shown in FigureQb); 
10: s — the line segment ((i,Xi), (j, Xj)); 

12: } 

13: J=J+1; 

14: } 

15: ELSE { 

16: output s; 

17: i=j-l;a?i=a:j_i;j=i + l; 

18: angle = e\ i<i+1) ; 

19: s = line segment ((j, x 4 ), (j + 1, x i+ i)); 

20: } 
21: } 



Figure 8: Algorithm PLAZA. 

there does not exist a line segment s with (i, Xi) as the left endpoint 
such that maxerr(s,Xi ■ ■ • x^ — Xj) < e. 

Algorithm PLAZA works as follows. Starting from a point Xi, 
Lemma[5]is used to check if there is a line segment approximating 
points between indexes i and j(i < j). Moreover, Lemma[4]is 
used to check if searching further in the time series is futile. The 
pseudocode of PLAZA is shown in Figure [8] Algorithm PLAZA 
scans each point in a data stream only once and stores only the zon- 
ing angle and the current approximating segment in main memory, 
the algorithm clearly has linear time complexity and constant space 
complexity. 

5.2 Benchmarking PLAZA 

PLAZA creates connecting line segments. Only transmission of 
one point is needed for each line segment except for the first line 
segment. This feature distinguishes PLAZA from algorithms Point- 
Bound and SegmentBound. What is the optimal compression that 
can be achieved by an e-PLA consisting of only connecting line 
segments? 

The idea behind the optimal PLAZA benchmark algorithm is 
similar to that of algorithm PointBound. The main difference is 
that, unlike the PointBound algorithm, we do not start the new seg- 
ment with the initial condition x; — e < yi < Xi + e, where yi is 
the value of the left endpoint of the new segment. Instead we set 
a smaller range on yi to guarantee the connectivity of two consec- 
utive segments. Specifically, to decide the range of yi, we use the 
last non-empty polygon intersection in the previous point. 

We find the optimal solution by a thorough search. Starting 
from xi, we try all values of j such that xi, . . . ,Xj can be ap- 
proximated by a line segment with maximal error e. For each 
such a subset xi, . . . ,Xj, we compute the intersection of paral- 



lelograms poly(l, 2) n ■ ■ ■ n poly(l,j), and try to find a line seg- 
ment with left endpoint (j,yj) that can approximate some points 
Xj+i, . . . , Xi where j + 1 < i and yj is in the range confined by 
poly(l, 2) n ■ ■ ■ n poly(l,j). By doing so, the first and the sec- 
ond line segments are connected. We conduct a depth-first search 
to find an e-PLA consisting of the minimum number of connecting 
line segments. Limited by space, we omit the details here. 

The optimal PLAZA benchmark is an offline algorithm: it as- 
sumes the time series is given and can be scanned multiple times. 
Its complexity is far above linear due to the thorough search. This 
algorithm is obviously not suitable for online compression of data 
streams. It is for comparison purpose only. 

6. EXPERIMENTAL EVALUATION 

In this section, we evaluate the performance of our online algo- 
rithms by simulation in Matlab and by real implementation with 
MIC A2 motes fl5l. 

6.1 Experimental Setting 

We generated two audio files for test. The first file includes hu- 
man voice with the sampling rate of 8 khz in mono channel. The 
second file includes piano music with the sampling rate of 44 khz 
in mono channel. Each file includes 1, 000, 000 samples, and the 
size of each sample is 16 bits. Figures l9land!10lshow the waveform 
of the human voice data and the waveform of the piano music, re- 
spectively. It can be seen that the music data is much "smoother" 
than the human voice data. We use the files to test the performance 
of our online algorithms in bandwidth saving. We measure two 
metrics: 

1. Sample reduction ratio (inverted compression ratio). It is 
defined as the total number of points to represent the e-PLA 
divided by the total number of points in the original time 
series. 

y^' 1 (x ■ — x ■ ) 2 

2. Distortion. It is defined as ' =1 ' — — , where n is the 

n 

total number of points in the time series, Xi is the original 
value, and x~i is the approximated value of Xi. 

In simulation, we apply the online algorithms on the audio files 
and measure the sample reduction ratio. Simulation results are re- 
ported in Section 16.21 and Section 16.31 In the test using MICA2 
motes, the original audio files are played on a desktop computer 
and are monitored and transmitted with a MICA2 mote over wire- 
less channel to a laptop computer. More details are provided in 
Sectionl6~4l 

6.2 Results on Quality 

6.2.1 Results on Sample Reduction Ratio 

Figures[TT1andll2lshow the results of algorithms PointBound and 
SegmentBound, respectively, with respect to various error bound 
values. As shown in the figures, we can obtain a higher bandwidth 
saving on piano music than on human voice. By replaying the 
audio files recovered from the samples by our algorithms, we per- 
ceive that the human voice recovered from the samples by our algo- 
rithms is fully recognizable with the segment error bound up to 0.4, 
or with the point error bound up to 0.2. The quality of recovered 
piano music is acceptable to us with the segment error bound up to 
0.2, or with the point error bound up to 0.1. 

Figures[TT1andll2lclearlv demonstrate significant bandwidth sav- 
ing. With the online algorithms, we only need to transmit around 
5% of the original sample size for piano music and around 20% of 



Figure 9: The waveform of the human voice data (the Figure 10: The waveform of the piano music data (the 
lower part is in a smaller time scale). lower part is in a smaller time scale). 
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Figure 11: The sample reduction ratio of PointBound. 



Figure 12: The sample reduction ratio of Segment- 
Bound. 
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Figure 13: The sample reduction ratio of PLAZA . 



the original sample size for human voice. As such, both sound files 
can be transmitted with the current sensor nodes. 

Figure[T3lshows the sample reduction ratio of algorithm PLAZA 
with respect to various point error bounds. We can observe the 
similar phenomenon as in Figures QT] and [12] With PLAZA, we 
perceive that the recovered human voice is fully recognizable with 
the (point) error bound up to 0.2, and the quality of recovered piano 
music is acceptable to us with the (point) error bound up to 0.1. 
From Figure [T3] the above qualities correspond to the bandwidth 
reduction of nearly 3% of the original data size for piano music 
and about 15% of the original data size for human voice. 

One interesting phenomenon is that the SegmentBound algo- 
rithm can reduce sample transmission volume even if the error 
bound is set to zero, as shown in Figure[l2] This is because in the 
audio files, there are some silent periods where the sample values 
are close to zeros. The SegmentBound algorithm finds a line seg- 
ment to approximate those situations. This nice feature, however, 
does not exist in the algorithms for the PLA-PointBound problem. 
If the error bound is zero, the initial polygon is empty in the Point- 
Bound algorithm, and the degree of the initial feasible angle is zero 



in PLAZA, resulting in no sample reduction. 
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Figure 14: Comparison of the three algorithms on the human 
voice data set. 

Figure [14] compares algorithms PLAZA, PointBound, and Seg- 
mentBound on the human voice data set. The gap between algo- 
rithms PLAZA and PointBound is very small when the error bound 
is less than 0.5. Algorithm PointBound leads to more samples than 
algorithm PLAZA when the error bound is less than 0.3. The gap 
between algorithm SegmentBound and the two algorithms for the 
PLA-PoinBound problem comes from the fact that, using the same 
error bound value, the PLA-SegmentBound problem puts a tighter 
error constraint than the PLA-PointBound problem. We observe 
the similar performance comparison of the three algorithms on the 
piano data set, but omit the figures here due to space limit. 

6.2.2 Results on Distortion 

In Figures IT31 and [76l we quantitatively show the distortion of 
our algorithms on the human voice data set and the piano music 
data set, respectively. The overall distortion on human voice is 
larger than that on piano music due to the "smoother" waveform in 
the music data set. With the same error bound, algorithm PLAZA 
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has the largest distortion. Algorithm PointBound is the next. Algo- 
rithm SegmentBound has the smallest distortion because the same 
error bound on the PLA-SegmentBound problem and the PLA- 
PointBound problem poses a tighter error constraint on the PLA- 
SegmentBound problem. The smaller distortion, however, comes 
with the cost of lower bandwidth saving as analyzed before. 

6.3 Benchmarking PLAZA 

We test the performance of PLAZA comparing to the optimal 
solution of its kind (i.e., using connecting line segments to tackle 
the PLA-PointBound problem). Due to the high complexity of the 
PLAZA Benchmark method, the audio files are too big to obtain 
the optimal results within reasonable time. We have to use a small 
portion of the audio files for this test. 

Interestingly, the PLAZA method and the optimal PLAZA bench- 
mark algorithm generate very similar PLA line segments. Audio 
files are usually filled with short silent periods where sample values 
are close to 0. Thus, algorithm PLAZA can obtain line segments 
very similar to those computed by the benchmark algorithm. We 
omit the detailed figures due to space limit. 

6.4 Results on Real Sensors 

We implemented our online algorithms using MICA2 
motes 1 15 1 from Crossbow Technology Inc. The test bed is il- 
lustrated in Figure [17] A MICA2 mote includes a radio/processor 
board and a sensor board. The radio/processor board uses 900 Mhz 
radio. The sensor board includes a microphone that can be used 
for sampling sound. The interface of the base station is based on 
RS232. It acts as a gateway to connect the laptop and the radio 
wireless sensor network. The original audio files are played on a 
desktop computer, monitored by a MICA2 mote, and transmitted 
over wireless channel from the MICA2 mote to the base station. 

The results about the sample reduction ratio on the real sensor 
test bed are close to the simulation results using Matlab. But the 
audio quality obtained using the real test bed is worse than that ob- 
tained in the Matlab simulation. The deterioration in audio quality 
is caused by the major restriction of TinyOS |3 |, the current oper- 
ating system in MICA2 motes. The OS does not support multiple 
threads and thus it cannot perform radio transmission and sound 
sampling concurrently. Due to this limit, when we transmit data 
to the base station, the sensor board stops sampling and the sound 
during this period is missed, resulting in small silent gaps in the 
recovered audio. Nevertheless, we can still recognize the human 
speech and the piano music. 

The same task can be carried out with the most recent, more 
advanced sensor device, MICAz from the same company. With 




Figure 17: The test bed using real sensors. 



a higher price, MICAz sensors support up to 250 Kbps wireless 
transmission. This task, however, has never been fulfilled with low- 
end devices like MICA2. To this end, we break the limit of scarce 
radio bandwidth and carry out a task that is hard to achieve without 
our fast online compression methods. 

6.5 Evaluation in Other Applications 

Although we only implemented the online algorithms in an acous- 
tic sensor monitoring system, our algorithms are actually applica- 
ble to many other application domains such as electrocardiogram 
(ECG) monitoring for patients. We test our algorithm on an ECG 
data set The maximum value on the data set is 2, 490 and the min- 
imum value is —8, 190. We test our online algorithms with error 
bound varying from 1 to 100. 
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Figure 18: Results on an ECG data set. 

Figure [18] compares the sample reduction ratio of algorithms 
PLAZA, PointBound, and SegmentBound on the ECG data set. 



The performance of algorithms PLAZA and PointBound is very 
similar. When the error bound is set to over 35, both algorithms 
can compress the data up to 10% of the original size. The gap 
between algorithms SegmentBound and PointBound comes from 
the fact that, using the same error bound, the PL A- SegmentBound 
problem and the PLA-PointBound problem put a tighter error con- 
straint on the PL A- SegmentBound problem. 

7. CONCLUSION 

In this paper, we tackle the problem of online compression of 
data streams in the resource-constrained network environment, where 
the traditional data compression techniques cannot apply. Particu- 
larly, we aim at fast piecewise linear approximation (PLA) meth- 
ods with quality guarantee. We study two versions of the problem 
which explore quality guarantees in different forms. For the error 
bounded PLA problem, we design fast online algorithms running in 
linear time complexity and requiring a constant space cost. The on- 
line algorithms are also optimal in terms of the number of generated 
segments. To meet the needs from tiny, resource-constrained sen- 
sors, we develop another online algorithm that involves very simple 
computation and generates connecting line segments. Our simula- 
tion results and the test using a real sensor test bed demonstrate 
that our fast online linear approximation methods are very effective 
for data stream compression and transmission over low bandwidth 
networks with nodes heavily constrained in computational power. 

Equipped with the insights gained in this study, we see a lot of 
application opportunities for our methods. Meanwhile, there are 
also some interesting open questions for future work. For example, 
an interesting question is to design an online algorithm that can 
compute an e-PLA consisting of connecting line segments that has 
an approximation factor to the optimum. 
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